Hive Metastore: Merge hive and avro schema if not consistent by funcheetah · Pull Request #55 · linkedin/iceberg

funcheetah · 2021-02-02T22:39:00Z

No description provided.

shardulm94 · 2021-02-03T04:58:29Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/LegacyHiveSchemaUtils.java

+      cols.addAll(getPartitionCols(table));
+    }
+
+    return convertFieldSchemaToAvroSchema(recordName, recordNamespace, true, cols);


Instead of passing cols which is a List can we create a struct field schema here and call HiveTypeToAvroType directly? We can get rid of convertFieldSchemaToAvroSchema

I see this is a tail reference/call, we can just move the code in convertFieldSchemaToAvroSchema inside convertHiveSchemaToAvro. But aside from that, what do you mean by "create a struct field schema"? I see the parseSchemaFromStruct method from HiveTypeToAvroType, but it is a private method, are you referring to this method?

shardulm94 · 2021-02-03T06:46:29Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

+    for (int i = 0; i < fieldNames.size(); ++i) {
+      final TypeInfo fieldTypeInfo = fieldTypeInfos.get(i);
+      String fieldName = fieldNames.get(i);
+      fieldName = removePrefix(fieldName);


Do we need this? I think field names being passed here are relative, they come from StructTypeInfo.getAllStructFieldNames() so I don't think they are qualified from the root. A . in the field name is probably the actual name of the field here.

Yeah, I think it can be removed? also since we are dealing with hive, I feel the field names won't contain . anyways.

This should be easy to verify.

shardulm94 · 2021-02-03T06:48:33Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

+        // We don't cache the structType because otherwise it could be possible that a field
+        // "lastname" is of type "firstname", where firstname is a compiled class.
+        // This will lead to ambiguity.


I am not sure what this comment means. Which cache are we referring to?

shardulm94 · 2021-02-03T06:53:34Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/LegacyHiveTableUtils.java

    if (schemaStr != null) {
-      schema = AvroSchemaUtil.toIceberg(new org.apache.avro.Schema.Parser().parse(schemaStr));
+      org.apache.avro.Schema avroSchema = new org.apache.avro.Schema.Parser().parse(schemaStr);
+      org.apache.avro.Schema hiveSchema = LegacyHiveSchemaUtils.convertHiveSchemaToAvro(table);


At this step during the conversion we pass mkFieldsOptional as true to make fields nullable, but in the very next line we remove nullables from the schema. Can we just mark fields as non-nullable to being with and remove LegacyHiveSchemaUtils.extractActualTypeIfFieldIsNullableTypeRecord?

Good catch, I think I can change the signature of convertHiveSchemaToAvro to add this boolean flag parameter to `convertHiveSchemaToAvro(Table table, boolean mkFieldsOptional), so that this function directly return a non-null version of the schema.

shardulm94 · 2021-02-05T21:43:02Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/LegacyHiveSchemaUtils.java

+import org.slf4j.LoggerFactory;
+
+
+public class LegacyHiveSchemaUtils {


The code in this class is way too verbose. This can be heavily simplified by using Visitors

I feel the same, but I feel this will require non-trivial refactor of the code. I think we also want to publish this code soon, so there is this trade-off.

shardulm94 · 2021-02-05T21:43:23Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

+import org.codehaus.jackson.node.JsonNodeFactory;
+
+
+public class HiveTypeToAvroType {


Can we add test cases?

will do later, but the integration test I ran already passed all the tables.

shardulm94 · 2021-02-05T21:44:51Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/LegacyHiveTableUtils.java

+
+      org.apache.avro.Schema tableSchema = avroSchema;
+      boolean isHiveSchemaEvolved =
+          LegacyHiveSchemaUtils.isRecordSchemaEvolved(avroSchemaWithoutNullable, hiveSchemaWithoutNullable);


Seems like isRecordSchemaEvolved has to traverse the whole Schema tree. What do we gain by checking this first rather than just merging directly?

I think this is actually a good point. I think the 2 logic can be combined into just one pass.

wmoustafa

Still going through the patch. Currently, there are many ways conversions take place and seems they could be simplified. For example, we can go from Hive type string to Hive TypeInfo (one type info to represent the whole schema) then to Avro schema.

wmoustafa · 2021-02-12T08:19:54Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

+    for (int i = 0; i < fieldNames.size(); ++i) {
+      final TypeInfo fieldTypeInfo = fieldTypeInfos.get(i);
+      String fieldName = fieldNames.get(i);
+      fieldName = removePrefix(fieldName);


This should be easy to verify.

wmoustafa · 2021-02-12T08:24:00Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

+        schema = parseSchemaFromUnion((UnionTypeInfo) typeInfo, recordNamespace, recordName);
+        break;
+      default:
+        throw new UnsupportedOperationException("Conversion from " + category + " not supported");


is not supported

wmoustafa · 2021-02-12T08:27:27Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

+    // For example, in tracking.CommunicationRequestEvent.specificRequest,
+    // PropGenerated and PropExternalCommunication have the same structure. In case of duplicate typeinfos, we generate


Best not to mention actual table and field names.

wmoustafa · 2021-02-12T08:54:47Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/LegacyHiveSchemaUtils.java

+
+    final List<FieldSchema> cols = new ArrayList<>();
+
+    cols.addAll(table.getSd().getCols());


There were concerns around using getSd().getCols(). Could you check if we should use HiveMetastoreClient.getSchema()?

wmoustafa · 2021-02-12T08:54:58Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

+    for (TypeInfo ti : typeInfos) {
+      Schema candidate;
+      if (ti instanceof StructTypeInfo) {
+        StructTypeInfo sti = (StructTypeInfo) ti;


sti --> structTypeInfo

wmoustafa · 2021-02-12T08:55:00Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

+    // a new record type for the duplicates.
+    List<Schema> schemas = new ArrayList<>();
+
+    for (TypeInfo ti : typeInfos) {


ti --> typeInfo

wmoustafa · 2021-02-12T09:04:44Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

+  private static final String SHORT_TYPE_NAME = "short";
+  private static final String BYTE_TYPE_NAME = "byte";
+
+  public HiveTypeToAvroType(String namespace, boolean mkFieldsOptional) {


Does it make sense to convert this to a utility class and move those parameters to convertFieldsTypeInfoToAvroSchema?

wmoustafa · 2021-02-12T09:13:54Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

+  Schema convertFieldsTypeInfoToAvroSchema(String recordNamespace, String recordName, List<String> fieldNames,
+      List<TypeInfo> fieldTypeInfos) {


Can we avoid using list of field names and list of field types throughout the PR? For example instead of of List<String> fieldNames and List<TypeInfo> fieldTypeInfos, we just pass StructTypeInfo. Usually the input is already a StructTypeInfo and then it is broken down to two lists then dealt with here. In cases where the original input comes as two lists, we can combine them using TypeInfoFactory. getStructTypeInfo() from Hive.

wmoustafa · 2021-02-12T09:34:54Z

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

+        // We don't cache the structType because otherwise it could be possible that a field
+        // "lastname" is of type "firstname", where firstname is a compiled class.
+        // This will lead to ambiguity.
+        schema = parseSchemaFromStruct((StructTypeInfo) typeInfo, recordNamespace, recordName);


Can we rename this and other methods to something like convertStructTypeInfoToAvroSchema?

Wenye Zhang added 3 commits February 2, 2021 14:36

Hive Metastore: Merge hive and avro schema if not consistent

bd218dd

check consistency before merging schema

a78eee6

fix build

09eabef

shardulm94 reviewed Feb 5, 2021

View reviewed changes

wmoustafa requested changes Feb 12, 2021

View reviewed changes

wmoustafa mentioned this pull request Feb 20, 2021

merge hive native schema and avro schema literal if they are inconsistent #56

Closed

shardulm94 mentioned this pull request Feb 22, 2021

Alternative implementation of #55 Hive Metadata Scan: Merge Hive and Avro schemas if they are inconsistent #57

Merged

wmoustafa closed this Apr 30, 2021

		import org.slf4j.LoggerFactory;


		public class LegacyHiveSchemaUtils {

		import org.codehaus.jackson.node.JsonNodeFactory;


		public class HiveTypeToAvroType {

		// For example, in tracking.CommunicationRequestEvent.specificRequest,
		// PropGenerated and PropExternalCommunication have the same structure. In case of duplicate typeinfos, we generate


		final List<FieldSchema> cols = new ArrayList<>();

		cols.addAll(table.getSd().getCols());

		Schema convertFieldsTypeInfoToAvroSchema(String recordNamespace, String recordName, List<String> fieldNames,
		List<TypeInfo> fieldTypeInfos) {

Conversation

funcheetah commented Feb 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wmoustafa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants